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Abstract 

Regularization is a powerful technique for extracting useful information from noisy data. 
Typically, it is implemented by adding some sort of norm constraint to an objective function 
and then exactly optimizing the modified objective function. This procedure often leads to 
optimization problems that are computationally more expensive than the original problem, a 
fact that is clearly problematic if one is interested in large-scale applications. On the other 
hand, a large body of empirical work has demonstrated that heuristics, and in some cases 
approximation algorithms, developed to speed up computations sometimes have the side- 
effect of performing regularization implicitly. Thus, we consider the question: What is the 
regularized optimization objective that an approximation algorithm is exactly optimizing? 

We address this question in the context of computing approximations to the smallest non- 
trivial eigenvector of a graph Laplacian; and we consider three random- walk-based procedures: 
one based on the heat kernel of the graph, one based on computing the the PageRank vector 
associated with the graph, and one based on a truncated lazy random walk. In each case, we 
provide a precise characterization of the manner in which the approximation method can be 
viewed as implicitly computing the exact solution to a regularized problem. Interestingly, the 
regularization is not on the usual vector form of the optimization problem, but instead it is 
on a related scmidcfinite program. 

1 Introduction 

Regularization is a powerful technique in statistics, machine learning, and data analysis for learn- 
ing from or extracting useful information from noisy data [141 |6l H]. It involves (explicitly or 
implicitly) making assumptions about the data in order to obtain a "smoother" or "nicer" so- 
lution to a problem of interest. The technique originated in integral equation theory, where it 
was of interest to give meaningful solutions to ill-posed problems for which a solution did not 
exist [22]. More recently, it has achieved widespread use in statistical data analysis, where it 
is of interest to achieve solutions that generalize well to unseen data [9]. For instance, much 
of the work in kernel-based and manifold-based machine learning is based on regularization in 
Reproducing kernel Hilbert spaces [19]. 

Typically, regularization is implemented via a two step process: first, add some sort of norm 
constraint to an objective function of interest; and then, exactly optimize the modified objective 
function. For instance, one typically considers a loss function f(x) that specifies an empirical 
penalty depending on both the data and a parameter vector x; and a regularization function 
g(x) that encodes prior assumptions about the data and that provides capacity control on the 
vector x. Then, one must solve an optimization problem of the form: 

x = argmin a ,/(x) + Xg(x). (1) 
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A general feature of regularization implemented in this manner is that, although one obtains 
solutions that are "better" (in some statistical sense) than the solution to the original problem, 
one must often solve a modified optimization problem that is "worse" (in the sense of being 
more computationally expensive) than than the original optimization problemQ Clearly, this 
algorithmic-statistical tradeoff is problematic if one is interested in large-scale applications. 

On the other hand, it is well-known amongst practitioners that certain heuristics that can 
be used to speed up computations can sometimes have the side-effect of performing smoothing 
or regularization implicitly. For example, "early stopping" is often used when a learning model 
such as a neural network is trained by an iterative gradient descent algorithm; and "binning" 
is often used to aggregate the data into bins, upon which computations are performed. As we 
will discuss below, we have also observed a similar phenomenon in the empirical analysis of very 
large social and information networks [12]. In these applications, the size-scale of the networks 
renders prohibitive anything but very fast nearly- linear-time algorithms, but the sparsity and 
noise properties of the networks are sufficiently complex that there is a need to understand the 
statistical properties implicit in these fast algorithms in order to draw meaningful domain-specific 
conclusions from their output. 

Motivated by these observations, we are interested in understanding in greater detail the 
manner in which algorithms that have superior algorithmic and computational properties either 
do or do not also have superior statistical properties. In particular, we would like to know: 

• To what extent can one formalize the idea that performing an approximate computation 
can implicitly lead to more regular solutions? 

Rather than addressing this question in full generality, in this paper we will address it in the 
context of computing the first nontrivial eigenvector of the graph Laplacian. (Of course, even this 
special case is of interest since a large body of work in machine learning, data analysis, computer 
vision, and scientific computation makes use of this vector.) Our main result is a characterization 
of this implicit regularization in the context of three random- walk-based procedures for computing 
an approximation to this eigenvector. In particular: 

• We consider three random-walk-based procedures — one based on the heat kernel of the 
graph, one based on computing the the PageRank vector associated with the graph, and 
one based on a truncated lazy random walk — for computing an approximation to the small- 
est nontrivial eigenvector of a graph Laplacian, and we show that these approximation 
procedures may be viewed as implicitly solving a regularized optimization problem exactly. 

Interestingly, in order to achieve this identification, we need to relax the standard spectral opti- 
mization problem to a semidefinite program. Thus, the variables that enter into the loss function 
and the regularization term are not unit vectors, as they are more typically in formulations such as 
Problem (pQ), but instead they are distributions over unit vectors. This was somewhat unexpected, 
and the empirical implications of this remain to be explored. 

Before proceeding, let us pause to gain an intuition of our results in a relatively simple setting. 
To do so, consider the so-called Power Iteration Method, which takes as input annxn symmetric 
matrix A and returns as output a number A (the eigenvalue) and a vector v (the eigenvector) 
such that Av = AvU The Power Iteration Method starts with an initial random vector, call it uq, 

1 Think of ridge regression or the £1 -regularized ^2-regression problem. More generally, however, even assuming 
that g(x) is convex, one obtains a linear program or convex program that must solved. 

2 Our result for the truncated lazy random walk generalizes a special case of the Power Method. Formalizing 
the regularization implicit in the Power Method more generally, or in other methods such as the Lanczos method 
or the Conjugate Gradient method, is technically more intricate due to the renormalization at each step, which by 
construction we will not need. 



2 



and it iteratively computes u t+ i = Au t /\\Ai/ t \\2. Under weak assumptions, the method converges 
to vi, the dominant eigenvector of A. The reason is clear: if we expand vq = Y^i=i^i v i m the 
basis provided by the eigenfunctions {vi}f =1 of A, then v% = Y27=i ^\ Vi ~* Vl - ^ we truncate 
this method after some very small number, say 3, iterations, then the output vector is clearly a 
suboptimal approximation of the dominant eigen-direction of the particular matrix A; but due 
to the admixing of information from the other eigenvectors, it may be a better or more robust 
approximation to the best "ground truth eigen-direction" in the ensemble from which A was 
drawn. It is this intuition in the context of computing eigenvectors of the graph Laplacian that 
our main results formalize. 

2 Overview of the problem and approximation procedures 

For a connected, weighted, undirected graph G = (V,E), let A be its adjacency matrix and D 
its diagonal degree matrix, i.e., Da = Ylj-(ij)eE w ii-> wnere w ij 1S the weight of edge (ij). Let 
M = AD~ l be the natural random walk transition matrix associated with G, in which case 
W = (I + M)/2 is the usual lazy random walk transition matrix. (Thus, we will be post- 
multiplying by column vectors.) Finally, let L = I — D -1 / 2 AD^ 1 ^ 2 be the normalized Laplacian 
of G. 

We start by considering the standard spectral optimization problem. 

SPECTRAL : min x T Lx 

s.t. x T x = 1 

x T D l / 2 l = 0. 

In the remainder of the paper, we will assume that this last constraint always holds, effectively 
limiting ourselves to be in the subspace W 1 _L 1, by which we mean {x G R n : x T D 1 / 2 l = 0}. 
(Omitting explicit reference to this orthogonality constraint and assuming that we are always 
working in the subspace W 1 _L 1 makes the statements and the proofs easier to follow and does 
not impact the correctness of the arguments. To check this, notice that the proofs can be carried 
out in the language of linear operators without any reference to a particular matrix representation 
in R n .) 

Next, we provide a description of three related random- walk-based matrices that arise natu- 
rally when considering a graph G. 

• Heat Kernel. The Heat Kernel of a connected, undirected graph G can be defined as: 

H t = exp(-tL) = Y, [ - 1 J-L k , (2) 

k=0 

where t > is a time parameter. Alternatively, it can be written as Ht = Y^i &~ Xit Pi, where 
Aj is the i-th eigenvalue of L and Pi denotes the projection into the eigenspace associated 
with Xi. The Heat Kernel is an operator that satisfies the heat equation = —LHt and 
thus that describes the diffusive spreading of heat on the graph. 

• PageRank. The PageRank vector 7r(7, s) associated with a connected, undirected graph 
G is defined to be the unique solution to 

7r(7, s) = 7s + (1 - 7)M7r(7, s), 

where 7 G (0, 1) is the so-called teleportation constant; s £ W 1 is a preference vector, 
often taken to be (up to normalization) the all-ones vector; and M is the natural random 
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walk matrix associated with G [10] Jj If we fix 7 and s, then it is known that n(j, s) = 
7^^ (1 — j) t M t s, and thus that 7r(7, s) = i? 7 s, where 

i? 7 = 7 (/-(l-7)M)- 1 . (3) 

This provides an expression for the PageRank vector 7r(7, s) as a 7-dependent linear trans- 
formation matrix i? 7 multiplied by the preference vector s That is, Eqn. ([3]) simply 
states that PageRank can be presented as a linear operator i? 7 acting on the seed s. 

• Truncated Lazy Random Walk. Since M = AD -1 is the natural random walk transition 
matrix associated with a connected, undirected graph G, it follows that 

W a = al + (1 - a)M (4) 

represents one step of the a-lazy random walk transition matrix, in which at each step 
there is a holding probability a € [0, 1]. Just as M is similar to M' = D^^MD 1 / 2 , which 
permits the computation of its real eigenvalues and full suite of eigenvectors that can be 
related to those of M, W a is similar to W' a = D~ 1 l 2 W a D l l' 1 . Thus, iterating the random 
walk W a is similar to applying the Power Method to W' a , except that the renormalization 
at each step need not be performed since the top eigenvalue is unity. 

Each of these three matrices has been used to compute vectors that in applications are then used 
in place of the smallest nontrivial eigenvector of a graph Laplacian. This is typically achieved 
by starting with an initial random vector and then applying the Heat Kernel matrix, or the 
PageRank operator, or truncating a Lazy Random Walk. 

Finally, we recall that the solution SPECTRAL can also be characterized as the solution to a 
semidefinite program (SDP). To see this, consider the following SDP: 

SDP : min L»X 

s.t. Tr(X) = I • X = 1 

x y 0, 

where • stands for the Trace, or matrix inner product, operation, i.e., A • B = Ti(AB T ) = 
Y^ij AijBij for matrices A and B. (Recall that, both here and below, / is the Identity on the 
subspace perpindicular to the all-ones vector.) SDP is a relaxation of the spectral program 
SPECTRAL from an optimization over unit vectors to an optimization over distributions over 
unit vectors, represented by the density matrix X. 

To see the relationship between the solution x of SPECTRAL and the solution X of SDP, recall 
that a density matrix X is a matrix of second moments of a distribution over unit vectors. In this 
case, L»X is the expected value of x T Lx, when x is drawn from a distribution defined by X. If X 
is rank-1, as is the case for the solution to SDP, then the distribution is completely concentrated 
on a vector v , and the SDP and vector solutions are the same, in the sense that X = vv T . More 
generally, as we will encounter below, the solution to an SDP may not be rank-1. In that case, a 
simple way to construct a vector x from a distribution defined by X is to start with an n-vector £ 
with entries drawn i.i.d. from the normal distribution iV(0, 1/n), and consider x = X 1 / 2 ^. Note 
that this procedure effectively samples from a Gaussian distribution with second moment X. 

Alternatively, one can define 7r'(7, s) to be the unique solution to tv — 7s + (1 — 7) Wir, where W is the !/2-lazy 
random walk matrix associated with G. These two vectors are related as 7/(7, s) = tt( j^Sr, s) pQ. 
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3 Approximation procedures and regularized spectral optimiza- 
tion problems 



3.1 A simple theorem characterizing the solution to a regularized SDP 

Here, we will apply regularization technique to the SDP formulation provided by SDP, and we 
will show how natural regularization functions yield distributions over vectors which correspond 
to the diffusion-based or random- walk-based matrices. In order to regularize SDP, we want to 
modify it such that the distribution is not degenerate on the second eigenvector, but instead 
spreads the probability on a larger set of unit vectors around v. The regularized version of SDP 
we will consider will be of the form: 

(F,r/)-SDP min L • X + i/„ ■ F(X) 
s.t. ImX = 1 

x y o, 

where rj > is a trade-off or regularization parameter determining the relative importance 
of the regularization term F(X), and where F is a real strictly-convex infinitely-differentiable 
rotationally-invariant function over the positive semidefinite cone. (Think of F as a strictly con- 
vex function of the eigenvalues of X.) For example, F could be the negative of the von Neumann 
entropy of X; this would penalize distributions that are too concentrated on a small measure of 
vectors. We will consider other possibilities for F below. Note that due to F, the solution X of 
(F, 77) — SDP will in general not be rank-1. 

Our main results on implicit regularization via approximate computation will be based on the 
following structural theorem that provides sufficient conditions for a matrix to be a solution of 
a regularized SDP of a certain form. Note that the Lagrangian parameter A and its relationship 
with the regularization parameter rj will play a key role in relating this structural theorem to the 
three random-walk-based proceudres described previously. 

Theorem 1 Let G be a connected, weighted, undirected graph, with normalized Laplacian L. 
Then, the following conditions are sufficient for X* to be an optimal solution to (F, 77) — SDP. 

1. X* = (VF)- 1 (77 • (A*/ - L)), for some X* e M, 

2. I • X* = 1, 

3. X* y 0. 

Proof: For a general function F, we can write the Lagrangian C for (F, 77) — SDP as follows: 

C(X, A, U) = L • X + i • F(X) -X-(I»X-1)-U»X 

where A € R, U >z 0. The dual objective function is 

h(X,U) = mm£(X,X,U). 

As F is strictly convex, differentiable and rotationally invariant, the gradient of F over the positive 
semidefinite cone is invertible and the righthand side is minimized when 

X = (VF)- 1 ^ ■ (-L + X* ■ I + [/)), 
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where A* is chosen such that the second condition in the statement of the theorem is satisfied. 
Hence, 

h(X*, 0) = L • X* + - ■ F(X*) — \* ■ (I • X* — 1) = L • X* + — ■ F(X*). 

7] T] 

By Weak Duality, this implies that X* is an optimal solution to (F, 77) — SDP. 

o 

Two clarifying remarks regarding this theorem are in order. First, the fact that such a A* exists 
is an assumption of the theorem. Thus, in fact, the theorem is just a statement of the KKT 
conditions and strong duality holds for our SDP formulations. For simplicity and to keep the 
exposition self-contained, we decided to present the proof of optimality, which is extremely easy 
in the case of an SDP with only linear constraints. Second, we can plug the dual solution (A*,0) 
into the dual objective and show that, under the assumptions of the theorem, we obtain a value 
equal to the primal value of X*. This certifies that X* is optimal. Thus, we do not need to 
assume U = 0; we just choose to plug in this particular dual solution. 



3.2 The connection between approximate eigenvector computation and im- 
plicit statistical regularization 

In this section, we will consider the three diffusion-based or random-walked-based heuristics 
described in Section [2j and we will show that each may be viewed as solving (F,ry) — SDP for an 
appropriate value of F and r\. 

Generalized Entropy and the Heat Kernel. Consider first the Generalized Entropy func- 
tion: 

F H (X) = Tr (X log X) - Tr (X) , (5) 

for which: 

(VF H )(X) = logX 

(vf h )- 1 (y) = exp y. 

Hence, the solution to (Fh,t/) — SDP has the form: 

X* H = eMv ~ L)), (6) 

for appropriately-chosen values of A and rj. Thus, we can establish the following lemma. 

Lemma 1 Let X^j be an optimal solution to (F, 77) — SDP, when F(-) is the Generalized Entropy 
function, given by Equation $E§. Then 

Xi - Hr < 



^ Tr [H V Y 

which corresponds to a "scaled" version of the Heat Kernel matrix with time parameter t = n. 

Proof: From Equation ©, it follows that Xjj = exp(— 77 • L) ■ exp(r/ • A), and thus by setting 
A = — Y?j log(Tr(exp(— n ■ L))), we obtain the expression for Xjj given in the lemma. Thus, 
Xjj >z and Tr(Af^) = 1, and by Theorem Q] the lemma follows. 

o 

Conversely, given a graph G and time parameter t, the Heat Kernel of Equation ([2]) can be 
characterized as the solution to the regularized (FhiV) ~ SDP, with the regularization parameter 
rj = t (and for the value of the Lagrangian parameter A as specified in the proof). 
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Log-determinant and PageRank. Next, consider the Log-determinant function: 

F D (X) = -logdet(X), (7) 

for which: 

(VF D )(X) = -X- 1 
(VF D )~\Y) = -Y-\ 

Hence, the solution to (Fd,t/) — SDP has the form: 

X* D = -( V -(\I-L))-\ (8) 

for appropriately-chosen values of A and rj. Thus, we can establish the following lemma. 

Lemma 2 Let X^ be an optimal solution to (F, 77) — SDP, when F(-) is the Log- determinant 
function, given by Equation Then 

Tr[ii 7 ] 

which corresponds to a "scaled- and- streached" version of the PageRank matrix i? 7 of Equation f3|) 
with teleportation parameter 7 depending on i]. 

Proof: Recall that L = I — D -1 / 2 AD~^I 2 . Since Xp = l /r) ■ (L — A/)" 1 , by standard manipulations 
it follows that 

XI = - ((1 - A)/ - D-^AD- 1 ' 2 ) 1 . 

Thus, Xp y if A < 0, and Xp y if A < 0. If we set 7 = (which varies from 1 to 0, as A 
varies from —00 to 0), then it can be shown that 



X 



D 



I J D-V2 7 (J - (1 - 7 )AD- 1 r 1 D 1 ' 2 . 



rjX 

By requiring that 1 = Tr[X£>], it follows that 

rj = (1 - 7 )Tr Ul - (1 - 7)AD~ 1 )~ 1 

and thus that r?A = -Tr [y(I - (1 - ^AD' 1 )' 1 ] . Since Ry = j(I - (1 - 7)AD -1 )- 1 , the lemma 
follows. 

o 

Conversely, given a graph G and teleportation parameter 7, the PageRank of Equation (|3|) can be 
characterized as the solution to the regularized (Fq,??) — SDP, with the regularization parameter 
T) as specified in the proof. 

Standard p-norm and Truncated Lazy Random Walks. Finally, consider the Standard 
p-norm function, for p > 1: 

F p (X)= 1 -\\X\\v= 1 -Tr(Xn, (9) 



for which: 



(VF p )(X) = X?- 1 

(VF p )- 1 (y) = yVb-D. 
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Hence, the solution to (F p , 77) — SDP has the form 



( v -(\i-L)y-\ 



(10) 



where q > 1 is such that l /p + l /q = 1, for appropriately-chosen values of A and 77. Thus, we can 
establish the following lemma. 

Lemma 3 Let X* be an optimal solution to (F, 17) — SDP, when F(-) is the Standard p-norm 
function, for p > 1, given by Equation |7|). Then 



X* 



-(9-1) _ 1 o-l 



Tr 



wt 1 



which corresponds to a " 'scaled- and- streached" version ofq—1 steps of the Truncated Lazy Random 
Walk matrix W a of Equation |^J) with laziness parameter a depending on rj. 

Proof: Recall that L = I - D' 1 / 2 AD' 1 / 2 . Since X* = rfi~ x -i 
it follows that 



(XI—L) q , by standard manipulations 



Xt = rf- x ( (A - 1)1 + D-^AD- 1 / 2 



q-l 



Thus, X* y if A > 1, and X* y if A > 1. If we set a = (which varies from to 1, as A 
varies from 1 to 00), then it can be shown that 



(riXy^D-^- 1 ^ 2 (al - (1 - ajAD-y- 1 D^l 2 . 



By requiring that 1 = TYpC*], it follows that 

7y = (l-a){Tr {al + (I - ajAD-y 1 } 



i-p 



and thus that 77A = | 
lemma follows. 



Tr 



al + (1 - a)AD- l ) q 1 } P . Since W a = al + (1 - a)AD~ x , the 



Conversely, given a graph G, a laziness parameter a, and a number of steps q' = q — 1, the Trun- 
cated Lazy Random Walk of Equation Q can be characterized as the solution to the regularized 
(F p ,r/) — SDP, with the regularization parameter n as specified in the proof. 



4 Discussion and Conclusion 

There is a large body of empirical and theoretical work with a broadly similar flavor to ours. 
Here, we provide just a few citations that most informed our approach. 

• In machine learning, Belkin, Niyogi, and Sindhwan describe a geometrically-motivated 
framework within which semi-supervised learning algorithms can be constructed [3]; Saul 
and Roweis (and many others, but less explicitly) observe that adding a regularization term 
to improve numerical properties also "acts to penalize large weights that exploit correla- 
tions beyond some level of precision in the data sampling process" [IB] : Rosasco, De Vito, 
and Verri describe how a large class of regularization methods designed for solving ill-posed 
inverse problems gives rise to novel learning algorithms [17J; Zhang and Yu show that in 
boosting, early stopping (as opposed to waiting for full convergence) leads to regulariza- 
tion and hence better prediction [23J; Shi and Yu describe statistical aspects of binning in 
Gaussian kernel regularization [20]; and Bishop observes that training with noise can be 
equivalent to Tikhonov regularization [5]. 
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• In numerical linear algebra, O'Leary, Stewart, and Vandergraft have described issues that 
arise in estimating the largest eigenvalue of a positive definite matrix with the power 
method [15) : and Parlett, Simon, and Stringer have described convergence issues that arise 
when estimating the largest eigenvalue with an iterative method [16J. 

• In the theory of algorithms, Spielman and Teng describe how to perform local graph par- 
titioning using truncated random walks [2Tj; Andersen, Chung, and Lang describe an im- 
proved local graph partitioning algorithm PageRank vectors pQ; and Chung describes how to 
perform similar operations by using the heat kernel and viewing it as the so-called pagerank 
of a graph [8] . 

• In internet data analysis, Andersen and Lang use these methods to try to find communities in 
large networks [2|; Leskovec, Lang, and Mahoney use these and other methods to show that 
there do not exist good large communities in large social and information networks [11) [T2) ; 
and Lu et al. empirically evaluate implicit regularization constraints for improved online 
review quality prediction [13) . 

None of this work, however, takes the approach we have adopted of asking: What is the regularized 
optimization objective that a heuristic or approximation algorithm is exactly optimizing? 

We should note that one can interpret our main results from one of two alternate perspectives. 
From the perspective of worst-case analysis, we provide a simple characterization of several related 
methods for approximating the smallest nontrivial eigenvector of a graph Laplacian as solving a 
related optimization problem. By adopting this view, it should perhaps be less surprising that 
these methods have Cheeger-like inequalities, with related algorithmic consequences, associated 
with them |21) [T) 15) 17]. From a statistical perspective, one could imagine one method or another 
being more or less appropriate as a method to compute robust approximations to the smallest 
nontrivial eigenvector of a graph Laplacian, depending on assumptions being made about the data. 
By adopting this view, it should perhaps be less surprising that these methods have performed 
well at identifying structure in sparse and noisy networks [2) [TT) [12) [T3) . 

The particular results that motivated us to ask this question had to do with recent empirical 
work on characterizing the clustering and community structure in very large social and informa- 
tion networks [11) 112) . As a part of that line of work, Leskovec, Lang, and Mahoney (LLM) [12] 
were interested in understanding the artifactual properties induced in output clusters as a func- 
tion of different approximation algorithms for a given objective function (that formalized the 
community concept). LLM observed a severe tradeoff between the objective function value and 
the "niceness" of the clusters returned by different approximation algorithms. This phenomenon 
is analogous to the bias-variance tradeoff that is commonly-observed in statistics and machine 
learning, except that LLM did not perform any explicit regularization — instead, they observed 
this phenomenon as a function of different approximation algorithms to compute approximate 
solutions to the intractable graph partitioning problem. 

Although we have focused in this paper simply on the problem of computing an eigenvector, 
one is typically interested in computing eigenvectors in order to perform some downstream data 
analysis or machine learning task. For instance, one might be interested in characterizing the 
clustering properties of the data. Alternatively, the goal might be to perform classification or 
regression or ranking. It would, of course, be of interest to understand how the concept of 
implicit regularization via approximate computation extends to the output of algorithms for these 
problems. More generally, though, it would be of interest to understand how this concept of 
implicit regularization via approximate computation extends to intractable graph optimization 
problems (that are not obviously formulatable as vector space problems) that are more popular 
in computer science. That is: What is the (perhaps implicitly regularized) optimization problem 
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that an approximation algorithm for an intractable optimization problem is implicitly optimizing? 
Such graph problems arise in many applications, but the the formulation and solution of these 
graph problems tends to be quite different than that of matrix problems that are more popular 
in machine learning and statistics. Recent empirical and theoretical evidence, however, clearly 
suggests that regularization will be fruitful in this more general setting. 
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